A multi-information fusion anomaly detection model based on convolutional neural networks and AutoEncoder

Network traffic anomaly detection, as an effective analysis method for network security, can identify differentiated traffic information and provide secure operation in complex and changing network environments. To avoid information loss caused when handling traffic data while improving the detection performance of traffic feature information, this paper proposes a multi-information fusion model based on a convolutional neural network and AutoEncoder. The model uses a convolutional neural network to extract features directly from the raw traffic data, and a AutoEncoder to encode the statistical features extracted from the raw traffic data, which are used to supplement the information loss due to cropping. These two features are combined to form a new integrated feature for network traffic, which has the load information from the original traffic data and the global information of the original traffic data obtained from the statistical features, thus providing a complete representation of the information contained in the network traffic and improving the detection performance of the model. The experiments show that the classification accuracy of network traffic anomaly detection using this model outperforms that of classical machine learning methods.

significant results in their respective domains, they mostly focus on single or specific types of data anomalies.Further exploration and improvement are required in handling the integrity of raw data and addressing the complexities of network traffic anomaly detection.
Although numerous existing studies have combined intelligent models such as machine learning and deep learning, leading to improvements in various detection methods and metrics, there is a lack of research on the impact of partial information loss on detection caused by the cropping of raw traffic data.This has hindered the development of effective information models to enhance detection accuracy and performance.To address this gap, this paper proposes a multi-information fusion model based on convolutional neural networks (CNNs) and AutoEncoder (AE).The model utilizes CNNs to extract information from the raw traffic data and AutoEncoder to obtain compressed global information from the statistical features.By merging the high-level information extracted from the raw traffic data and the global information extracted from the statistical features, a new comprehensive feature is formed.This comprehensive feature is then learned through neural networks to obtain a more complete traffic feature representation, ultimately improving the model's detection performance.
The main contributions of this paper are as follows: 1. To address the issue of information loss during the process of cropping raw traffic data, a multi-information fusion model is proposed.By leveraging the fusion characteristics of different models, this approach enables comprehensive acquisition of network traffic information, thereby effectively enhancing the performance of anomaly detection in network traffic.2. The detection gain effect of multiple information sources is obtained by considering the information classification and recognition abilities of CNN, as well as the feature extraction and data reconstruction characteristics of AutoEncoder.By fully utilizing their respective advantages for local feature learning and statistical feature extraction, richer feature representations are achieved through model fusion, ultimately improving the classification accuracy of the model.3. Improve the model's robustness and generalization ability, enabling it to perform well across various datasets and scenarios.By leveraging the distinct characteristics and learning abilities of CNN and AutoEncoder, their fusion compensates for the shortcomings and reduces the bias and variance of the model.Consequently, the stability, reliability, and decision-making capability of the anomaly detection model are enhanced.

Model architecture
A multi-information fusion model based on convolutional neural networks and AutoEncoder (MF-CA) is shown in Fig. 1, and the model structure consists of three main parts: pre-processing module, flow information extraction module, and classification module.In the first part of the pre-processing module, the original pcap traffic will be cropped and sliced.In the second part of traffic information extraction, convolutional neural networks is used to extract high-level features from the original network traffic data, and statistical features are also calculated from the original traffic.AutoEncoder is then used to compress the extracted statistical features.The statistical features do not have to be changed according to the different tasks, they only need to be designed once.The statistical features are mainly used to compensate for the loss of global information due to traffic cropping.Afterwards, we fuse the features extracted by the convolutional neural network with the features obtained by AutoEncoder compression to obtain the combined features.The third part feeds the fused features into the neural network for classification.

Pre-processing module
Because the input to the neural network has a specified format requirement, the raw data needs to be preprocessed first.The initial format of the traffic packets captured in the network is usually pcap format, with the content represented as hexadecimal encoded data, which can be transformed into IPs, ports and other content familiar to security analysts.As shown in Fig. 2, the pre-processing process in this paper is divided into three steps, namely "traffic slicing", "traffic cleaning" and "statistical feature extraction".where

Slicing
In the resulting stream, the number of packets in the stream is not the same and varies considerably over a range of timestamps.So we do not use all the packets in the stream.Instead, the first 10 packets in each stream are selected for feature learning.2. Traffic cleaning: As each network data in the original file contains five network layers of Ethernet layer, network layer, transport layer, and application layer structure.Among them, the MAC source address, MAC destination address and protocol version in Ethernet do not change frequently within the same LAN, and these fields do not contribute to the detection performance of the model in the same network.Therefore, these fields are not used as flow information in this paper.Secondly, the version and different services fields in the network layer also remain almost unchanged in the same network, as we cull these two fields as well.
And we find that most of the streams have less than 10 packets, with a small number of streams having more than 10 packets.Since the size of the payload part in each packet is generally different, to make full use of the information in these packets, we selected the first 160 bytes in each packet as the packet feature, using 0 to supplement if it was less than 160 bytes, and cropping if it was more than 160 bytes.In the end, for each stream we extracted a total of 1600 dimensions of raw data.3. Statistical feature extraction: 26 statistical features are extracted for each stream.For example, the number of packets the stream contains, the duration of the first packet to the last packet in the stream, etc.

Convolutional neural network module
The Convolutional Neural Network module is responsible for feature extraction from the raw network traffic data.The convolutional neural network requires an input of a specified size, so we make all the traffic the same size by cropping or padding it with zeros.In this paper, the convolutional neural network is used to extract the features of the traffic.
The convolutional neural network model designed is shown in Fig. 3.The process of feature extraction by convolutional neural network in this paper is as follows.
Firstly, the raw traffic of fixed byte size after cropping is input to the first convolution layer, where the input data size is V = {v 1 , v 2 , ..., v n } , where n is the size of the cropped traffic.1600 bytes of the network flow is chosen as the input data for the model in this paper.Afterwards, the convolution operation is performed using onedimensional convolution.The size of the convolution kernel is 1 × h.The size of the convolution kernel for the first layer of convolution in this paper is 1 × 25.The convolution kernel operates on a set of traffic bytes and outputs new features.The convolution operation is specified by the following equation: where b is the bias value, f is the Relu nonlinear activation function, w is the parameter that the model updates during training, and the convolution kernel slides over each flow window {x i , x i+h } to perform a convolution operation that ultimately produces an output feature mapping.
(1) www.nature.com/scientificreports/After the convolution operation, the maximum pooling operation is performed on the feature vector s i to reduce the model training parameters, the maximum pooling is to keep the largest feature value in the corresponding feature block as the feature value of the region, the maximum pooling parameter chosen in this paper is 1 × 5 with a step size of 5.The formula for the maximum pooling operation is as follows: The output vector s after pooling is one-fifth of the input vector s .It continues to be fed into the next convolutional layer as well as the pooling layer for feature extraction manipulation, and the output feature vector is finally fed into the fully-connected layer to extract the high-level feature V pcap .The convolutional kernel of the second convolution and the second pooling layer have the same parameters as the previous convolutional pooling, and after two fully-connected layers, where the output of the first fully-connected layer is 1024 and the output of the second fully-connected layer is 600, are used to and the compressed statistical features are combined into a composite feature.The specific model parameters for the convolutional neural network module are shown in Table 1 below.

AutoEncoder module
The AutoEncoder module is responsible for extracting information from statistical features: we extract statistical information from each network flow before performing traffic cropping.These statistics contain some global information that is lost in the network streams due to clipping, such as the "Max pkts length" (Maximum length of traffic packets in the flow), the "Max payload" (maximum value of payload in flow)and the "Num pkt" (the number of packets in the flow).In this paper, 26 statistical features were selected, as shown in Table 2, which are the 26 statistical features extracted in this paper.Among all the statistical features obtained, in order to unify  www.nature.com/scientificreports/ the numerical impact between different magnitudes, this paper adopts the max-min normalisation method for data processing.Equation (6) shows that: After the max-min normalisation process, the value domain of the statistical features is [0, 1].The AutoEncoder compresses these 26 statistical features to obtain some useful information from them.AutoEncoder compresses these 26 statistical features into a 10-dimensional vector space, see Table 2 for details.
The extracted statistical features are fed into a AutoEncoder for feature compression, and the AutoEncoder structure used in this paper is shown in Fig. 4.
The input data S is feature-coded and compressed by an encoder with the following equation.
where g e denotes the nonlinear activation function operation of the input data x through the encoder, W 1 denotes the parameter to be learned by the neural network in the encoder, b 1 is the bias value and σ denotes the Relu activation function.
After encoding by the encoder, a compressed feature v is obtained.The compressed feature v is fed into the decoder for reconstruction, and the decoder is given the following equation.
where g d denotes the reconstruction of the compressed feature v through the decoder.w 2 parameters to be learned by the neural network in the decoder, b 2 is the bias value and σ is the Relu activation function.
The purpose of the AutoEncoder is to reduce the reconstruction error between the decoder's output x′ to the input data x by continuously optimising the parameters, which in general will be done using Eq. ( 9). ( 6)

Classification modules
The advanced features extracted by the convolutional neural network were first spliced with the advanced features compressed by the AutoEncoder, as the original convolutional neural network input was 1600 and the original statistical features were 26.The ratio was approximately 60:1, so we kept the extracted advanced features in this ratio as well and fed the combined features into the classification module for classification.The specific parameters of the neural network for the classification module are shown in Table 4.
Where the output of the last layer depends on the classification task, binary classification and octet classification were applied to the data, after which the softmax activation function is used for activation.The calculation is shown in Eq. ( 10): where p(c i |n) denotes the probability of sample n under category c i , r denotes the number of categories, and the sample category is the category with the highest probability.

Algorithmic process of multi-information fusion model based on CNN and Autodecoder
The optimal network parameters saved during model training are used to input the data to be measured into the trained optimal network model, as shown in Fig. 5, which shows the flowchart of the algorithm.The original traffic data is firstly cropped according to the pre-processing method to retain the same size of traffic and extract the statistical features of the original traffic data at the same time, after which the weights of the model are initialized and the cropped traffic is input into the CNN module, and the features are extracted after the calculation of Eqs.(3) to (5), and the statistical features are input into the Autodecoder module and the statistical features are compressed after the calculation of Eqs. ( 7) to (9) .By continuously training the model and adjusting the model parameters until the end condition is met, and saving the optimal network parameters, the test data is input into the trained model for testing, so as to determine the class to which the test data belongs.

Description of experimental data
This paper utilizes the commonly used CICIDS2017 dataset 29 .As a widely accepted benchmark, CICIDS2017 demonstrates good applicability 30 and has been extensively employed in research on various network anomaly detection methods 31 .While the dataset providers offer both the original pcap files and a pre-processed version containing 78 statistical features extracted using the CICFlowMeter tool, this paper focuses on processing the original pcap file data.The CICIDS2017 dataset comprises network traffic collected by simulating a real-world attack scenario, spanning five days from Monday, July 3, 2017, to Friday, July 7, 2017.Monday's data consists solely of normal network traffic, while the data from Tuesday to Friday includes instances of DDOS, brute force FTP, and botnet attacks.The network traffic is accurately labeled based on five-tuple information for each flow.Normal traffic as well as 10 types of attack traffic were extracted from the original dataset as the test and training sets for the model under study.The data pre-processing process is described in detail.The final labels of the data we extracted and the corresponding quantities are shown in Table 5.From Table 5 it can be seen that the data samples labelled Normal and Port Scan far exceed the number of other samples.To prevent imbalance in classification accuracy due to too many samples of these two types of labels, these two types of samples are randomly undersampled during the octet classification experiments, with 10,000 sample data reserved for each type.In contrast, no undersampling was performed for the binary classification.
The original traffic was pre-processed and turned into 1600-dimensional data.For visualisation, the feature length of 1600 dimensions was transformed into a 40 × 40 matrix, and Fig. 6 shows the grey-scale map of the eight types of samples.From Fig. 6 it can be seen that there are clear differences between the different samples, for example Normal and Botnet labels, where it is easy to see that the texture differs between the two samples,   www.nature.com/scientificreports/but there are also samples that are more similar, for example the two samples DDos and FTP-Parator labels, where there are only subtle differences between them.In contrast, samples of the same species all have a similar distribution between them, for example the Portscan and Web Attack labels.In summary, there is a large difference between samples from different labels, while samples from the same label have relatively similar textures.

Analysis of experimental results
The performance evaluation metrics for the models use accuracy, precision, recall and F1 score.Convolutional neural network models using only raw traffic data, AlexNet, ResNet, and some classical machine learning models were selected for binary Classification and octet classification experiments.In order to ensure that all category samples are equally distributed between the test and training sets, we select 70% of the total data from each label as the training set and 30% as the test set.The coding table for the classification labels is shown in Table 6.

Determination of model parameters
Some parameters of the convolutional neural network were experimentally tuned and the performance of the model was largely determined by the parameters of the neural network.Firstly, the number of convolutional channels was experimentally analysed to find the optimal number of convolutional channels.The number of convolutional channels in the first layer was set to 16, 32, 64 and 128, respectively, to compare the effect of the number of channels on the classification accuracy of the model, while controlling other parameters constant.
The experimental results are shown in Fig. 7. From Fig. 7, it can be seen that in binary classification, the classification accuracy gradually increases when the number of convolutional channels is 16, 32 and 64, and starts to decrease when the number of channels is 128.And at the number of channels of 64, the accuracy only increases by 0.01%, but it takes more training time.At octet classification.The number of channels already starts to decrease at 64.Therefore, combining performance and cost considerations, the model proposed is finally chosen to have a first layer of 32 channels.
Secondly, in order to compare the effect of different network layers in CNN on the performance of the model, we selected the number of layers of convolutional layers from one to four and conducted accuracy comparisons, keeping other parameters constant, and the experimental results are shown in Table 7.
As can be seen from Table 7, the model with one layer of convolution performs the worst.At binary classification, the performance of the three-layer convolution is slightly better than that of the two-layer convolution, and the accuracy of the model starts to drop when the number of layers reaches four; at multiple classification, the accuracy of the model starts to drop already at three layers of convolution.Although the performance of the three-layer model is somewhat better when it comes to binary classification, deeper layers www.nature.com/scientificreports/mean more parameters and higher learning costs.Combined performance and cost considerations, the two-layer convolutional neural network model was finally chosen as our layer in this paper.Finally, to explore the effect of 1D and 2D convolution on model performance, we also tested the difference in classification accuracy of the model at octet classification for different layers of 1D and 2D convolution.
From Table 8, it can be found that the classification accuracy of the model with two-dimensional convolution at any number of layers is lower than that of the one-dimensional convolutional neural network model.

Binary classification experiments
The classification results of this paper on the CICIDS2017 dataset using classical machine learning models such as CNN and Random Forest are presented in Table 9.In this paper, This CNN1D model, which is the CNN model applied in this paper, employs one-dimensional convolution to directly extract features from the raw traffic without combining statistical features compressed by the AutoEncoder.As can be seen from Table 9, the deep learning models generally outperformed the traditional machine learning models when binary classification was performed on the CICIDS2017 dataset, all achieving an accuracy of over 99%, and the model proposed in this paper performed the best of all models, reaching the highest values in all four metrics.The SVM had the worst performance among the models.

Multi-classification experiments
To further validate the performance of the model, an experimental study of octet classifications was conducted on the CICIDS2017 dataset.As shown in Table 10, the experimental results of octet classification are presented.From Table 10, it can be found that the performance of the three deep learning models, AlexNet, ResNet and CNN1D, still outperformed the traditional machine learning models in octet classification, while among the traditional machine learning models, KNN had the worst performance.The model proposed achieved 98.51% accuracy, 98.31% precision and 98.31% F1 score, which were the highest among all models with the highest values.
To clearly show how well the model predicted the samples for each class, we plotted the confusion matrix shown in Fig. 8.As can be seen in Fig. 8, we classified most of the samples correctly.The majority of the incorrectly predicted samples of label 4 were predicted to sample 5.As can be seen from the grey scale plot, there are indeed many similarities between the two labelled samples, which makes it more difficult to classify the model and may lead to misclassification of the model.To compare the classification performance of the models on different types, we plotted the F1 scores of the models on each category.As shown in Fig. 9, it can be seen from the graph that the model proposed achieves the best F1 scores on most labels, and overall performance is the best.To facilitate the analysis of the test performance of the proposed algorithm in this paper, additional statistical analyses were conducted.The standard deviation and mean accuracy are presented in Figs. 10 and 11, respectively.It is worth noting that the proposed algorithm in this paper achieves the smallest standard deviation and highest average accuracy, indicating its superior robustness and generalization ability.

Analysis and summary
To facilitate a deeper understanding of the model's architecture, functionality, implementation details, and relevant characteristics, the following section will provide further analysis and discussion.
Firstly, regarding the model's structural design.In this study, the model's input source data from the CICIDS2017 dataset.The raw data is stored in pcap format, recording a series of raw byte streams of network packets.By parsing these pcap files, relevant information about network flows was extracted and based on this, the traffic data was segmented into individual flows for model feature processing.Feature engineering is a critical step to ensure model performance.Initially, the raw pcap data was cleaned by removing unnecessary addresses and fields.Then, the data for each flow was cropped to ensure all input data had a uniform dimension.To further extract useful global information, an AutoEncoder was used to compress the statistical features.Finally, www.nature.com/scientificreports/high-level features extracted by the convolutional neural network are combined with AutoEncoder-compressed statistical features to form comprehensive features, capturing more extensive traffic information.During the comprehensive evaluation of the model, particular attention was paid to its accuracy, generalization ability, and robustness.To measure these performance metrics, accuracy, precision, recall, and F1 score were used as the primary evaluation standards.These metrics not only reflect the model's performance on a specific dataset but also provide a reference for its potential performance in practical applications.
The experimental results demonstrate that the model achieved high accuracy in various classification tasks.These results not only demonstrate the model's good accuracy on the current dataset but also indicate that the model has excellent generalization ability and can adapt to different network traffic patterns.Meanwhile, the model's generalization ability was also validated in tests involving different types of network attack scenarios.The model performed well not only on known attack types but also maintained high detection accuracy when faced with unknown attack types, further proving the model's good generalization performance.Additionally, through the analysis of the confusion matrix, the model's performance in various categories was further observed, which helps in understanding the model's decision-making process.In terms of evaluating the model's robustness, particular attention was given to its performance when faced with noise and outliers in the dataset.Testing with complex environmental data revealed that the model maintained high performance even in the presence of incomplete data or noise, demonstrating its robustness.The stability of key performance metrics such as accuracy, precision, recall, and F1 score under different testing conditions further highlights the model's robustness in the face of different attack patterns and data perturbations.Combining these evaluation results provides a comprehensive understanding of the model's performance.The model not only performs well on the current dataset but also has the potential for application in a broader network environment.Feedback and Decision-making.The model's effectiveness is reflected not only in experimental results but also in its positive impact on real network environments.If the model undergoes further validation and is accepted, plans involve integrating its feedback loop into actual network environments to assist network administrators and security analysts in making more informed decisions.The integration process will include the following key steps: First, the model will be deployed in network traffic monitoring systems to analyze incoming data packets in real-time and identify potential anomalous behaviors.Second, once an anomaly is detected, the model will provide detailed alert information, including the type of anomaly, its severity, and possible impacts.A good feedback mechanism can not only help users identify and respond to current security threats but also effectively enhance the overall security of the network environment.
Secondly, regarding parameter optimization.To determine the optimal hyperparameters for the Convolutional Neural Network (CNN) and Autoencoder models, this paper employed Bayesian optimization.Bayesian optimization is an efficient global optimization strategy that guides the search process by constructing a probabilistic model of the hyperparameters, thereby finding the optimal solution with fewer evaluations.This paper defined a prior distribution to express the initial uncertainty about the hyperparameters and updated the posterior distribution based on the results of each model evaluation.Through this method, it was possible to gradually narrow down the search range and ultimately determine the hyperparameter combination that maximizes model performance.The specific implementation included the following steps: First, learning rate, number of network layers, and number of hidden units were selected as the target hyperparameters for optimization.Then, a search space was defined for each hyperparameter, and Gaussian processes were used to model the relationship between hyperparameters and model performance.In each iteration, the most promising hyperparameter combination was selected for evaluation based on the posterior distribution, and the model was updated.By repeating this process, the hyperparameter settings that performed best on the validation set were ultimately found.The reason for choosing Bayesian optimization lies in its efficiency and effectiveness in handling high-cost function evaluations, which is particularly important for hyperparameter tuning of deep learning models.Through Bayesian optimization, not only was the efficiency of the model tuning process improved, but the model's performance in anomaly detection tasks was also ensured.
Furthermore, to prevent overfitting, this paper adopts several measures to reduce the risk of model overfitting and ensure model performance.First, data preprocessing and normalization steps are implemented to clean the data and reduce noise interference.Then, L2 regularization is introduced to effectively control model complexity, limiting the model's tendency to overfit.Additionally, early stopping is applied during training, terminating the process when validation set performance improvements stagnate, preventing the model from overfitting the training data.Simultaneously, hyperparameter tuning is performed using Bayesian optimization methods to optimize parameter combinations, avoiding complex model data noise, and improving model generalization performance while reducing overfitting risk.Moreover, the model's performance metrics on both the training and validation sets are continuously monitored to ensure balanced model behavior.By combining these measures, the risk of overfitting is effectively reduced, ensuring that the model demonstrates better detection performance when dealing with complex network traffic.
Finally, regarding the applicability and foresight of the model.In the face of the dynamic nature of the cybersecurity environment, it is crucial for a model to possess foresight and adaptability in handling evolving network threats.This paper has specifically considered these aspects in the model design.The following are the key features of the model that enable it to adapt to new threats.Multi-Feature Fusion Strategy: The model employs a multi-feature fusion strategy, extracting both local features from raw traffic data and global information from statistical features.This fusion provides a comprehensive understanding and deep insight into network traffic, significantly enhancing the ability to identify emerging and unknown anomalous patterns.The integration of

Conclusion
In order to avoid the loss of traffic data information in the process of network anomaly detection and to improve the detection performance of traffic feature information, a multi-information fusion model based on CNN and AE is proposed.The method can combine the load information contained in the raw traffic data and the global information contained in the statistical features, and fuse them to obtain more complete and comprehensive information, thus allowing better feature representation and enhancing the ability to identify network traffic anomalies.The performance of our model was tested on the CICIDS2017 dataset.The results show that the method proposed in this paper achieves 99.35% accuracy in binary classification and 98.51% accuracy in octet classification, which is higher than deep learning models using only raw traffic data and classical machine learning models.

Figure 3 .
Figure 3.The structure of the CNN.

Figure 5 .
Figure 5. Network traffic anomaly detection algorithm flow chart.

a)Figure 6 .
Figure 6.Visualization of 8 types of samples in CICIDS2017 dataset.

Figure 9 .
Figure 9.Comparison of F1 scores in each category of each model.

Table 1 .
Specific parameters of convolutional neural network.

Table 2 .
26 statistical features.Two fully connected neural networks are used for both encoder and decoder respectively, and the specific parameters of the encoder and decoder are listed in Table3below.

Table 3 .
Parameters of the AutoEncoder.

Table 4 .
Neural network parameters of the classification module.

Table 5 .
CICIDS2017 label classification and quantity distribution.

Table 7 .
Classification results of different convolutional layers.

Table 8 .
Comparison of octet classification performance between different dimensional convolutions.

Table 9 .
The results of the binary classification experiment of each model on CICIDS2017.

Table 10 .
Comparison of the octet classification experiment results of each model on the CICIDS2017 dataset.Confusion matrix of the CICIDS2017 dataset.